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We propose a model of a learning agent whose interaction with the environment is governed by 
a simulation-based projection, which allows the agent to project itself into future situations before 
it takes real action. Projective simulation is based on a random walk through a network of clips, 
which are elementary patches of episodic memory. The network of clips changes dynamically, both 
due to new perceptual input and due to certain compositional principles of the simulation process. 
During simulation, the clips are screened for specific features which trigger factual action of the 
agent. The scheme is different from other, computational, notions of simulation, and it provides 
a new element in an embodied cognitive science approach to intelligent action and learning. Our 
model provides a natural route for generalization to quantum-mechanical operation and connects 
the fields of reinforcement learning and quantum computation. 



I. INTRODUCTION 

Computers of various sorts play a role in many pro- 
cesses of modern society. A prominent example is the 
personal computer which has a specific user interface, 
waiting for human input and delivering output in a pre- 
scribed format. Computers also feature in automated 
processes, for example in the production lines of a mod- 
ern factory. Here the input /output interface is usually 
with other machinery, such as a robot environment in a 
car factory. 

An increasingly important role is played by so-called 
intelligent agents that operate autonomously in more 
complex and changing environments. Examples of such 
environments are traffic, remote space, but also the inter- 
net. The design of intelligent agents, specifically for tasks 
such as learning [1], has become a unifying agenda of var- 
ious branches of artificial intelligence [2]. Intelligence is 
hereby defined as the capability of the agent to perceive 
and act on its environment in a way that maximizes its 
chances of success. In recent years, the field of embodied 
cognitive sciences [3] has provided a new conceptual and 
empirical framework for the study of intelligence, both in 
biological and in artificial entities. 

A particular manifestation of intelligence is creativity 
and it is therefore natural to ask: To what extent can 
agents or robots show creative behavior? Creativity is 
hereby understood as a distinguished capability of deal- 
ing with unprecedented situations and of relating a given 
situation with other conceivable situations. A similar 
question may arise in behavioral studies with animals, 
and it is related, on a more fundamental level, to the 
problem of free will [4]. 

In this paper, we introduce a scheme of information 
processing for intelligent agents which allows for an el- 
ement of creative behavior in the above sense. Its cen- 
tral feature is a projection simulator (PS) which allows 
the agent, based on previous experience -and variations 
thereof- to project itself into potential future situations. 
The PS uses a specific memory system, which we call 
episodic & compositional memory (ECM) and which pro- 



vides the platform for simulating future action before real 
action is taken. The ECM can be described as a stochas- 
tic network of so-called clips, which constitute the ele- 
mentary excitations of episodic memory [49]. Projective 
simulation consists of a replay of clips representing pre- 
vious experience, together with the creation of new clips 
under certain variational and compositional principles. 
The simulation requires a platform which is detached 
from direct motor action and on which fictitious action is 
continuously "tested" . Learning takes place by a contin- 
uous modification of the network of clips, which occurs 
in three distinct ways: (1) adaptive changes of transition 
probabilities between existing clips (bayesian updating); 
(2) creation of new clips in the network via new percep- 
tual input (new clips from new percepts); (3) creation of 
new clips from existing ones under certain compositional 
principles (new clips through composition) . 

In modern physics, the notion of simulation and the 
ultimate power of physical systems to simulate other sys- 
tems has become one of the central topics in the field of 
quantum information and computation [5]. A timely ex- 
ample is the universal quantum simulator, which is capa- 
ble of mimicking the time evolution of any other quantum 
system as described by Schrodinger's equation of motion; 
other examples are classical stochastic simulators that 
mimic the time-evolution of some complex process such 
as the weather or the climate. These are all examples of 
dynamic simulators, which simulate (that is, compute) 
the time evolution of a system according to some spec- 
ified law. It is important to note that these notions of 
simulators build on prescribed law, e.g. certain equations 
of motion provided by physical, biological, or ecological 
theory. 

The projection simulator that we discuss in this paper 
- both its classical and its quantum version - is entirely 
different and should be distinguished from these notions 
of simulators. As in standard theory of reinforcement 
learning [1], our notion of projective simulation builds 
entirely on experience (i.e. previously encountered per- 
ceptual input together with the actions of the agent). 
Projective simulation can be seen, in general terms, as 
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a continuous feedback scheme of a system (agent) en- 
dowed with some memory, interacting with its environ- 
ment. The function of PS is to re-excite fragments of 
previous experience (clips) to simulate future action, be- 
fore real action is taken. As part of the simulation pro- 
cess, sequences of fictitious memory will be created by 
a probabilistic excitation process. The contents of these 
fictitious sequences are evaluated and screened for spe- 
cific features, leading to specific action. The episodic 
and compositional memory thereby provides a reflection 
and simulation platform which allows the agent to de- 
tach from primary experience and to project itself into 
conceivable situations [50]. 



II. INTELLIGENT AGENTS 

In the following, we shall discuss the concept of projec- 
tive simulation in the framework of intelligent agents [2] . 
Realizations of intelligent agents could be robots, biolog- 
ical systems, or software packages (internet robots). An 
agent (see Figure 1) has sensors, through which it per- 
ceives its environment, and actuators, through which it 
acts upon the environment. Internally, one may imagine 
that it has access to some kind of computing device, on 
which the agent program is implemented. The function 
of the agent program is to process the perceptual input 
and output the result to the actuators. 



There is a body of literature in the fields of artificial 
intelligence and machine learning, where ideas of learning 
and simulation have been discussed in various contexts 
(for modern textbook introductions, see e.g. [1, 2, 3, 6]). 
The specific notion of episodic memory and its role for 
planning and prediction has been discussed in psychology 
in the 1970s [7, 8] and has since been attracting atten- 
tion in various fields including cognitive neuroscience and 
brain research, reinforcement learning, and even robotics 
[12, 14, 15, 16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30]. 
The model which we develop here differs however from 
previous work in essential respects, as will be elaborated 
on below. 
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Figure 1: Model of an agent. Adapted and modified from [2] 
(see text). 



Our model aims at establishing a general framework 
that connects the embodied agent research with funda- 
mental notions of physics. This requires a notion of sim- 
ulation in agents that is both physically grounded and 
sufficiently general in its constitutive concepts. We claim 
that the abstract notion of clips and of projective simula- 
tion as a random walk through the space of clips provides 
such a general framework, which allows for different con- 
crete realizations and implementations. This framework 
also allows us to generalize the model to quantum simu- 
lation, thereby connecting the problem of artificial agent 
design to fundamental concepts in quantum information 
and computation. 

The plan of the article is as follows. In Section II we 
briefly review the standard definition of artificial agents. 
In Section III we introduce and describe in more detail 
the projection simulator and our scheme of a learning 
agent based on episodic & compositional memory. Sec- 
tion IV introduces some formal notation. Section V pro- 
vides illustrations of the main concepts using examples 
of a learning agent in a simple computer game. In Sec- 
tion VI we compare our model of projective simulation 
with some related work in the fields of artificial intelli- 
gence, reinforcement learning, and the cognitive sciences. 
In Section VII we generalize the notion of the projection 
simulator to a quantum mechanical scheme and discuss 
the potential role of quantum information processing for 
artificial agent design. Section VIII concludes the paper. 



For a deterministic agent, a given percept history com- 
pletely determines the next step (actuator motion) of the 
agent. For a stochastic agent, it only determines the 
probabilities with which the agent will perform the pos- 
sible next actuator moves. In the present paper, we shall 
deal with the latter situation. 

The heart of the agent is usually considered to be its 
program. The program will depend on the nature of the 
agent and its environment. It will be different for robots 
that operate in city traffic, on the surface of a planet, or 
inside a human body. The environment usually has its 
own rules that need to be taken into account when de- 
signing the program: it is governed by the laws of physics 
or biology, and it may have limited accessibility, observ- 
ability, and predictability. The role of the program is to 
deal with environmental data (through its sensors) and 
let the agent respond to them in a rational way [2]. 

From a computer-science oriented perspective, it might 
seem as if the problem of intelligent agents were a mere 
software problem, i.e. reducible to algorithmic design. 
From such point of view, the "intelligence" of the agent is 
imported and its capability to react rationally within its 
environment depends entirely on the designer's ingenuity 
to anticipate all potential situations that the agent may 
encounter, and thus to build corresponding rules into the 
program. However, more recent developments in the area 
of embodied cognitive science [3] have emphasized physi- 
cal aspects of the emergence of intelligence, among them 
the fact that most biological or robotic agents are "em- 
bodied" and "situated" , meaning that they acquire infor- 



3 



mation about their environment - and thereby develop 
intelligent behavior - exclusively through physical inter- 
actions (via sensors) with the environment. 

In this paper, we will adopt such an embodied ap- 
proach to understanding intelligence [3]. We shall con- 
centrate on a specific aspect of intelligence and inves- 
tigate the possibility of creative behavior in robots or 
agents. In the spirit of the celebrated work of Braiten- 
berg and his vehicles [32], we will propose an explicit 
model of memory, which, together with the idea of pro- 
jective simulation, can give rise to a well-defined notion 
of creative behavior. The description of episodic memory, 
as a dynamic network of clips which grows as the agent 
interacts with the world, is thereby fully embedded in 
the agent architecture. 



III. LEARNING BASED ON PROJECTIVE 
SIMULATION 

In this section, we shall focus on one crucial element of 
the agent architecture, which is its memory, indicated by 
the two connected white boxes in Figure 1. There are var- 
ious and different aspects of memory, which enter in the 
discussion and which should be kept apart. Research in 
behavioral neuroscience [33] has shown that learning can 
be related to structural changes on the molecular level of 
a neural network, providing examples of Hebbian learn- 
ing [34] . The behavior of simple animals (such as the sea 
slug Aplysia [34]) can largely be described by a stimulus- 
reflex circuit, where the structure of this circuit changes 
over time. In the language of artificial agent research, 
this could be modeled as a reflex agent, whose program 
is modified over time (which represents the learning of the 
animal). In such type of learning, we have a separation 
of time scales into "learning" (shaping of circuit) versus 
"reflex" (execution of circuit) which is possible only for 
simple agents, but it cannot explain more complex pat- 
terns of behavior. 

Phenomenologically speaking, more complex behavior 
seems to arise when an agent is able to "think for a while" 
before it "decides what to do next." This means the 
agent somehow evaluates a given situation in the light 
of previous experience, whereby the type of evaluation 
is different from the execution of a simple reflex circuit. 
An essential step towards such more complex behavior 
seems to be the capability of reinvoking memory with- 
out inducing immediate motor action, which requires a 
separate level of representation and storage of previous 
experience. Such type of memory must thus be decoupled 
from immediate motor action and cannot, per definition, 
be part of a reflex circuit. 

To model intelligent behavior, people have studied arti- 
ficial agents of various sorts (utility-based, goal-oriented, 
logic-based, planning,...) [2] whose actions are the result 
of some program or set of rules. In so-called learning 
agents, the emphasis lies on modeling the emergence of 
behavior patterns when there are no specific rules a pri- 



ori specified, except that the agent remembers in one 
way or the other that certain percept-action pairs were 
rewarded or punished (reinforcement learning). 

Here we introduce a learning- type agent, whose deci- 
sions - i.e. "what to do next" in a given situation - 
depend not only on its previous experience with sim- 
ilar situations, but also on fictitious experience which 
it is able to generate on its own. The central element 
is a projection simulator (PS), together with a type of 
episodic memory system (ECM), which helps the agent 
to project itself into "conceivable" situations. Triggered 
by perceptual input, the PS calls memory and induces a 
random walk through episodic memory space. This ran- 
dom walk is primarily a replay of past experience asso- 
ciated with the perceptual input, which is evaluated be- 
fore it leads to concrete action. However, memory itself 
is changed dynamically, both due to actual experience 
and due to certain compositional principles of memory 
recall, which may create new content corresponding to 
fictitious experience that never really happened. In this 
model, it is essential to have a representation of the envi- 
ronment in terms of the episodic memory, which enables 
the agent to decouple from immediate connection with 
the environment and reflect upon its future actions. Im- 
portantly, this reflection is not realized as a sophisticated 
computational process, but it can be seen as a structural- 
dynamical feature of memory itself. 

As a physical basis of the PS, one can imagine a neural- 
network- type structure, where any primary experience 
is accompanied by a certain spatiotemporal excitation 
pattern of the network. The details of this architecture, 
including the way of encoding information, the concise 
learning rules, etc., are not important. The only relevant 
feature is that a later re-excitation with a similar pattern, 
due to whatever cause, will invoke similar experience. 
As the agent learns, it will relate new input with existing 
memory and thereby change the structure of the network. 
The only relevant aspect of the neural-network idea is, for 
our purposes, that any recall of memory is understood as 
a dynamic re-play of an excitation pattern, which gives 
rise to episodic sequences of memory. 

By episodes we mean patches of stored previous ex- 
perience. In the specific context of vision, one could 
also call it a "movie fragment" or "clip". In the follow- 
ing, we will use the terms episode and clip interchange- 
ably. Clips represent basic (but variable) units of mem- 
ory which will be accessed, manipulated, and created by 
the agent. Clips themselves may be composed of more 
basic elements of cognition such as color, shape, or mo- 
tion, but they represent the functional units in our theory 
of memory-driven behavior. 

Formally, episodic memory will be described as a prob- 
abilistic network of clips as illustrated in Figure 2. An ex- 
cited clip calls, with certain probabilities, another, neigh- 
boring clip. The neighborhood of clips is defined by the 
network structure, and the jump probabilities will be 
functions of the percept history. In the simplest version, 
only the jump probabilities (weights) change with time, 
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Figure 2: Model of episodic memory as a network of clips. 

while the network structure (graph topology) and the clip 
content is static. In a refined model, new clips (nodes in 
the graph) may be added, and the content of the clip may 
be modified (internal dimension of the nodes). A call 
of the episodic memory triggers a random walk through 
this memory space (network). In this sense, the agent 
jumps through the space of clips, invoking patchwork-like 
sequences of virtual experience [51]. Action is induced by 
screening the clips for specific features. When a certain 
feature (or combination of features) is present and above 
a certain intensity level, it will trigger motor action. 

In the following sections, we shall put some of these no- 
tions in a more formal framework, and illustrate the idea 
of projective simulation with concrete examples. These 
examples should be understood as illustrations of the un- 
derlying notions and principles. We discuss them in the 
context of simple problems of reinforcement learning, but 
the notion of projective simulation is more general and 
can be seen as a principle and building block for complete 
agent architectures. 



IV. MATHEMATICAL MODELING AND 
NOTATION 

In physical terms, the behavior of an agent (see Figure 
1) can be described as a stochastic process that maps 
input variables (percepts) to output variables (actions). 
An external view of the agent consists in specifying, at 
each time t : the conditional probability P^\a\s) for ac- 
tion a G A, given that percept s G S was encountered. 
This is also called the agent's policy in the theory of re- 
inforcement learning [1]. Here, S and A denote the set of 
possible percepts and actuator moves, respectively, which 
we are going to describe in more detail shortly. 

The dependence of this probability distribution on 
time t indicates, for any non-trivial agent, the existence 
of memory [52]. A corresponding internal description 
connects P^(a\s) with the memory of the agent and ex- 
plains how memory is built up under a given history of 
percepts and actions. 

In our model of the agent, memory consists of a net- 
work of episodes (or clips), which are sequences of 're- 
membered' percepts and actions. The operation cycle of 



an agent can be described as follows: (i) Encounter of 
percept s G S which happens with a certain probabil- 
ity ?W(s) [53]. The encounter of percept s G S trig- 
gers the excitation of memory clip c G C according to 
a fixed "input-coupler" probability function X(c\s). (ii) 
Random walk through memory/clip space C, which is 
described by conditional probabilities p^\c f \c) of call- 
ing/exciting clip d given that c was excited, (iii) Exit of 
memory through activation of action a, described by a 
fixed "output-coupler" function 0(a\c). 

In the following, we shall only consider finite agents, 
acting in a finite world. Percepts, actions, and clips are 
then elements of finite-sized sets, according to the follow- 
ing definitions: 

•Percept space: 

S EE (S1,S 2 ,. • • ,S N ) G Si X •• • X S N EE S, = 1,. . . , \Si\. 

The structure of the percept space 5, a cartesian product 
of sets, reflects the compositional (categorical) structure 
of percepts (objects). For example, s± could label the 
category of shape, S2 category of color, S3 category of 
size, etc. The maximum number of distinguishable input 
states is given by the product \S\ = \Si \ • • • \Sn\- 

•Actuator space: 

a ee (ai, a2, . . . , ajvf) £ A\ x • • • x Am = A, aj = 
1, . . . , \Aj\. The structure of the actuator space A reflects 
the categories (or, in physics terminology, the degrees of 
freedom) of the agent's actions. For example a\ could 
label the state of motion, the state of a shutter, a<$ 
the state of a warning signal, etc. All of this depends on 
the specification of the agent and the environment. The 
maximum number of different possible actions is given 
by the product \A\ = \Ai \ • • • \ A M \. 

Clips or episodes are elementary, short-time, dynamic 
processes in the agent's memory that relate to past ex- 
perience and that can be triggered by similar experience. 
A clip can be seen as a sequence of remembered (real 
or fictitious) percepts and actions. We distinguish per- 
cept s G S that is directly caused by the environment at 
a given time t, from a remembered (or a fictitious) per- 
cept /i(s) G fji(S) that has a certain representation in the 
agent's memory system. Similarly, we distinguish real ac- 
tions a G A executed by the agents from remembered (or 
fictitious) actions 11(a) G which can be (re-)called 

by the agent without necessarily leading to real action. 
Instead of the symbol ji(a) we will also use @ee /i(a) for a 
remembered action. The formal definition of a clip reads 
then as follows: 

• Clip space: 

c ee (c^),^ 2 ),...,^)) eC;c^ G/i(S)U/i(4 The in- 
dex L specifies the length of the clip. A simple example 
for L = 2 is the clip c = /J>(a)) =(®,@), which cor- 

responds to a simple percept-action pair. Clips of length 
L = 1 consist of a single remembered percept or action, 
respectively. In the subsequent examples, we will mainly 
consider probabilistic networks of such simple clips. 

Projective simulation is realized as a random walk in 
episodic memory, which serves the agent to reinvoke past 
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experience and to compose fictitious experience before 
real action is taken. Learning is achieved by evaluat- 
ing past experience, for example by simple reinforcement 
learning. In memory, this will lead to a modification of 
the transition probabilities between different clips, e.g. 
via Bayesian updating. We emphasize, again, that such 
kind of the evaluation happens entirely within memory 
space. If a certain percept-action sequence s — >• a was 
rewarded at time step t, it will typically mean that, in 
the subsequent time step t + 1, the transition probabil- 
ity (a\s) between clips @ and @ will be enhanced. 
This is only indirectly related to the conditional proba- 
bility P( £+1 )(a|s) for real action a given percept s. 

For convenience, and to emphasize the role of fictitious 
experience in episodic memory, we shall also introduce a 
third space which we call 
•Emotion space: 

e (ei,e 2 ,...,e x ) e E\ x • • • x E K = E, e k = 
1, . . . , \Ek\. In the simplest case K = 1 and \Ei | = 2, with 
a two- valued emotion state e\ = e G {©,©}. Emotional 
states are tags, attached to transitions between different 
clips in the episodic memory. The state of these tags 
can be changed through feedback (e.g. reward) from the 
environment. They are internal parameters and should 
be distinguished from the reward function itself, which is 
defined externally. Informally speaking, emotional states 
are remembered rewards for previous actions, they have 
thus a similar status as the clips. 

The reward function A is a mapping from S x A to I C 
IR. (real numbers) , where in most subsequent examples we 
consider the case / = 0, 1, A. In the simplest case, A = 
1: If A(s, a) = 1 then the transition s — »• a is rewarded; if 
A(s, a) = 0, it is not rewarded. A rewarded (unrewarded) 
transition will set certain emotion tags in the episodic 
memory to © (©), as discussed previously. We shall also 
consider situations where the externally defined reward 
function changes in time, which leads to an adaptation 
of the flags in the agent's memory. 



V. SIMPLE EXAMPLE: INVASION GAME 

To illustrate some of these concepts, let us consider the 
following simple game, which we call invasion (see Figure 
3). It has two parties, an attacker (A) and a defender (D) 
(the robot/agent). The task of D is to defend a certain 
region against invasion by A. The attacker A can enter 
the region through doors in a wall, which are placed at 
equal distances. The defender D can block a door and 
thereby prevent A from invasion. 

Initially, defender D and attacker A stand face-to-face 
at some door fc, see Figure 3. Next, the attacker will move 
either to the left or to the right, with the intention to pass 
through one of the adjacent doors. For simplicity, we may 
imagine that A disappears at door k and re-appears some 
time r later in front of one of the doors k — 1 or fc + 1. The 
defender D needs to guess - based on some information 
which we will specify shortly - where A will reappear and 




Figure 3: Game invasion. Defender agent D, whose task is to 
block the passage against invasion by the attacker A, tries to 
guess A's next move from a symbol shown. 

move to that door. (We may assume that D moves much 
faster than A so that, if its guess is correct, it will arrive 
at the next door before A). If A arrives at an unblocked 
door, it counts as a successful passage/invasion. The 
task of D is to hold off the attacker for as long (i.e. for as 
many moves) as possible. We can define an appropriate 
blocking efficiency. If A has successfully invaded, this 
particular duel is over, and the robot D will be faced with 
a new attacker appearing in front of the door presently 
occupied by the robot. 

Suppose that the attacker A follows a certain strategy, 
which is unknown to the robot D, but, before each move, 
A shows some symbol that indicates its next move. In 
the simplest case, as illustrated in Figure 3, this could 
be a simple arrow pointing right, =>, or left, <=, indicat- 
ing the direction of the subsequent move. It could also 
be a whole number, ±m, indicating how far A will move 
and in which direction [54]. The meaning of the sym- 
bols is a priori completely unknown to the robot, but 
the symbols can be perceived and distinguished by the 
robot. The only requirement we impose at the moment 
is that the meaning of the symbol stays the same over a 
sufficiently long period of time (longer than the learning 
time of the robot). Translated into real life, the "symbol" 
could be as mundane as the "direction into which the at- 
tacker turns it body" before disappearing (a robot does 
not know what this means a priori), it could be an ex- 
pression on its face, or some abstract symbol that A uses 
to communicate with subsequent invaders. The described 
setup is reminiscent of certain behavior experiments with 
drosophila, using a torsion-based flight simulator system 
and a reinforcement mechanism to train drosophila to 
avoid objects in its visual field [35, 36]. In this sense, 
the presented analysis may also be interesting for the in- 
terpretation of behavior experiments with drosophila or 
similar species. 

Using this simple game, we want to illustrate in the 
following how the robot can learn, i.e. increase its block- 
ing efficiency by projective simulation. We will consider 
different levels of sophistication of the simulation process 
(recovering simple reinforcement learning and associative 
learning as special cases). 

Put into the language introduced in the previous sec- 
tion, we consider a percept space that comprises two cat- 
egories 
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• Symbol shown by attacker: {<^=, =>} = Si, 

• Color of symbol: {red, blue} = 62, 

while the actuator space comprises a single category 

• Movement of defender: { — , +} = A, 
as does the emotion space 

• Emoticons: {©, ©} = E. 

In memory space, 0, 0, etc. correspond 
to memorized percepts/actions that have been per- 
ceived/executed by the agent. In the following, we re- 
gard @ and Q as separate clips of length L = 1. The 
role of the emotional tags is to indicate, at a given time, 
which of the transitions in clip space have recently led to 
a rewarded action. 

For the reward function A : S x A — > 0, 1, A, we 
often consider the simplest case A = 1 (except where 
explicitly indicated). For A(s,a) = 1 (0) the transition 
s —> a is rewarded (not rewarded). A rewarded tran- 
sition, A(s,a) = 1, will set certain emotion tags in the 
episodic memory to ©, which will influence the simula- 
tion dynamics. We shall also consider situations where 
the attacker changes its strategy in time, which leads to 
a time-dependent reward function and a corresponding 
adaptation of the flags in the agent's memory. 

The conditional probability that a running (or active) 
clip calls clip will be denoted by p( n \— | <<=), where 
the upper index n indicates the time step ( "experience of 
the agent"), i.e. how many encounters with an attacker 
have occured. 

Suppose that the attacker indicates with the symbols 
<=, => that it will move one door to the left, or to the 
right, respectively. Then, the episodic memory that will 
be built up by the agent has the graph structure as shown 
in Figure 4. 




Figure 4: Episodic memory that is build up by the defender- 
agent in Figure 3, if the attacker follows the static strategy 
to move one door to the left (right) after showing the symbol 
<^= (=>). The "emotion tags" at each of the transitions in the 
network indicate the associated feedback that is stored in the 
memory's evaluation system. Informally, emotion tags can be 
seen as remembered rewards for previous actions. They help 
the agent to evaluate the result of a simulation and to trans- 
late it into real action. If a clip transition in the simulation 
leads subsequently to a rewarded action, the state of its tag 
is set (or confirmed) to ©, and the transition probability in 
the next simulation is amplified. Otherwise the tag is set to 
© and the transition probability is attenuated (or simply not 
amplified). 



A. Projective simulation &; learning without 
composition 



As we have mentioned earlier, the interaction of the 
agent with the environment goes in cycles. In our simple 
example, the description of the nth cycle (or time step) is 
as follows: First, the agent perceives a percept s, which 
induces the excitation of the percept clip @. Here we as- 
sume that this excitation happens with unit probability, 
which corresponds to a simple choice for the input cou- 
pler function T(c\s) = 5(c — (s)) introduced in Section IV. 
The excited percept clip @ then triggers the excitation of 
action clip @ G {0, 0} with probability p^ (a \s). This 
can happen either in direct sequence, or after some other 
memory clips have been excited in between, as will be 
described in the following section. The excitation of an 
actuator clip @ usually leads to immediate (real) motor 
action a, corresponding to a simple choice for the out- 
put coupler 0(a\c) — S(c — @) of Section IV. But we 
will also consider different scenarios where the transla- 
tion into motor action may be delayed and depend itself 
on the emotional tag of the transition 0^0, resulting 
from a reward or penalty of that transition in previous 
cycles. After motor action a has been taken, it will ei- 
ther be rewarded or not. The result of this evaluation 
will then be fed back into the state of the episodic mem- 
ory, leading to an update of the transition probabilities 
p( n+1 )(a|s) for the next cycle and of the emotion state 
tagged to this transition. This completes the description 
of the n-th cycle. 

To provide a complete description of the episodic mem- 
ory we now need to specify the update rules, i.e. how a 
positive or negative reward (A = 1 or 0) changes the tran- 
sition probability between the associated clips. There 
are many choices possible. In the following, we choose a 
simple frequency rule, somewhat reminiscent of Hebbian 
learning in neural network theories, but we emphasize 
that other rules are equally suitable [55]. 

We assume that, under positive feedback, the condi- 
tional probabilities p^ n \a\s), with a G { — ,+}, s G {<^= 
, =>}, grow in proportion with the number of previous re- 
wards following the clip transition — > @. This means 
that, if, in time step n, the agent takes the rewarded ac- 
tion a after having perceived percept s, this will increase 
the probability that, in subsequent time step n + 1, an 
excited percept clip will excite an actuator clip @. In 
other words, this will increase the probability that, after 
perceiving the percept s next time, the agent will simu- 
late the correct action a. Depending on the details how 
the simulation is translated into real action, this will typi- 
cally also increase the probability that the agent executes 
the rewarded action. Note, however, that the distinction 
between simulated action and real action is an essential 
point and will give the agent more flexibility. 

Quantitatively, we define the transition probability 
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p( n \a\s) in terms of a weight matrix h: 

where h^ n \s) is the marginal 

M n) (s) = ^M n) (s,a). 



(i) 



(2) 



The weight matrix is, unless otherwise specified, initial- 
ized as 



h (1) (s,a) = l Va,,< 



(3) 



so that the conditional probability distributions 
{p^\a\s)} a are uniform for all s. 

The stepwise evolution of p( n \a\s), as a function of n, 
is stochastic and may, for a given agent, depend on the 
entire history of percepts and the actions taken by the 
agent. Suppose that, in time step n, the agent perceives 
symbol and then executes action cS- n \ There are two 
possible cases which we need to distinguish. 
Case (1): A(s^ n \a^) = 1, i.e. the agent did the "right 
thing" and the percept-action sequence (s^^a^) is re- 
warded. In this case, the weight of the h matrix will 
be increased by unity on the transition (5) — >@ with 
s — s( n ) and a = a^ n \ while it stays constant on all other 
transitions. To model the possibility that the agent can 
also forget, we introduce an overall dissipation factor 7 
(0 < 7 < 1) that drives the weights hS n \s,a) towards 
the equilibrium (uniform) distribution. Put together we 
thus have the update rule: 

/i (n+1) (s, a) - h M (s, a) = 5(s, s M )5(a, a M ) (4) 

- 7 [fc(»)( 5 ,a)-l]. 

Case (2): A(sW,aW) = 0, i.e. the agent did the "wrong 
thing" and the percept-action sequence (s^^a^) is not 
rewarded. In this case, all weights of the /i-matrix are 
simply decreased: 

M n+1 )(s,a) - h^ n \s,a) = -j[h( n \s,a) - 1]. (5) 

The two cases can be combined into a single formula 

M n+1) (s, a) - h M (s, a) = -~f[h M (5, a) - 1] 

+\5(s,s M )5(a,a M ) (6) 

with A = A(s( n \ a( n )), which also generalizes to a situa- 
tion with values of the reward function A different from 
and 1. 

From the updated weig hts h( n+1 \s,a), we obtain the 
transition probabilities (in clip space) for the next cycle, 



(7) 



The updating of the weights from hS n \s,a) to 
/z/ n+1 )(s, a) at the end of cycle n thus depends on which 



specific percept-action sequence (s^ n \a^) has actually 
occurred in cycle n. The probability for the latter is 
given by the joint probability distribution p( n )(s,a) = 
pM(s)pM(a\s) for (s, a) = (s^ n \a^). While p( n \s) 
will be given externally (it is controlled by the at- 
tacker, for example p( n \s) = 1/|5| for random attacks), 
the conditional probability P( n \a\s) will depend on the 
memory, that is, on the weights h^ n \s,a) and how the 
simulation is translated into real action. 

In the simplest model, the agent has reflection time 1, 
which corresponds to the following process. Initially the 
percept s activates the percept clip @. This excites the 
actuator clip @ with probability p( n \a\s). Regardless of 
whether the action a was previously rewarded or not, @ 
is coupled out, i.e., it is translated into the action a. In 
other words, any transition that ends up in a clip describ- 
ing some "virtual action" , leads to the corresponding real 
action. In this case, we obtain 



P M (a\s) =p M (a\s) 



h^ n \s,a) 

E fl h (n) (M)' 



(8) 



which complements the update rules of Eqs. (4) and (5), 
together with Eq. (1). 

A slightly more sophisticated model is obtained when 
the state of the emotion tags (© or ©), which is set by 
previous rewards, is used to affirm or inhibit immediate 
motor action. In this model, the memory is one step fur- 
ther detached from immediate action and the agent has a 
chance to "reflect" upon its action. To be specific, let us 
consider a strategy with reflection time R, which corre- 
sponds to the following process. As in the previous case, 
initially the percept s activates the percept clip 0, which 
activates the actuator clip @ with probability p^ n \a\s). 
However, only if the sequence @ — >• @ is tagged © (i.e. 
it was evaluated A(s,a) = 1 on the last encounter), the 
actuator clip @ is "coupled out", i.e. translated into a 
real action. If this is not the case (either the transi- 
tion was not evaluated before or it was evaluated ©[56]), 
the percept clip @ is re-excited, which in turn activates 
again some actuator clip © (where © and @ may be the 
same or different). If the new sequence (5, a') is tagged 
©, © triggers real actuator motion a! . Otherwise, the 
process is again repeated. For a model with reflection 
time R, the maximum number of repetitions is R— 1. At 
the end of the Rth round, the simulation must exit from 
any actuator clip, regardless of its previous evaluations. 
We are specifically interested in the success probability 
P( n \al\s) that the agent chooses a rewarded action a* 
after a given percept s (A(s, a*) — 1). For reflection time 
R, this is given by 



p {n \<\ 



1 



(l-p(")(a:| S )) J 



(9) 



which increases with R. Clearly, for larger reflection 
times the memory is used more efficiently. 

In our invasion game, the quantity of interest is the 
blocking efficiency, r^ n \ which corresponds to the average 



success probability (averaged over different percepts, i.e. 
symbols shown by the attacker). After the nth round, 
the blocking efficiency is thus given by 

r (n) = J2p( n \s)p( n \a* s \s). (10) 

ses 

In a similar way one can define the learning time r{r t h) 
for a given strategy as the time it takes on average (over 
an ensemble of identical agents) until the blocking effi- 
ciency reaches a certain threshold value r th- 
in the following, we show numeric results for differ- 
ent agent specifications. Let us start with agents with 
reflection time R = 1. In Figure 5, we plot the learn- 
ing curves for different values of the dissipation rate 7 
(forgetfulness). One can see that the blocking efficiency 
increases with time and approaches its maximum value 
typically exponentially fast in the number of cycles. For 
small values of 7 it approaches the limiting value 1, i.e. 
the agent will choose the right action for every shown 
percept. For increasing values of 7, we see that the max- 
imum achievable blocking efficiency is reduced, since the 
agent keeps forgetting part of what it has learnt. At time 
step n = 250, the attacker suddenly changes the meaning 
of symbols: => (<=) now indicates that the attacker is go- 
ing to move left (right). Since the agent has already built 
up memory, it needs some time to adapt to the new situ- 
ation. Here, one can see that forgetfulness can also have 
a positive effect. For weak dissipation, the agent needs 
longer to unlearn, i.e. to dissipate its memory and adapt 
to the new situation. Thus there is a trade-off between 
adaptation speed, on one side, and achievable blocking ef- 
ficiency, on the other side. Depending on whether learn- 
ing speed or achievable efficiency is more important, one 
will choose the agent specification accordingly. Note that 
for random action, which is obtained by setting A = in 
(6), the average blocking is 0.5 (not shown in Figure 5). 

Note that the existence of an adaptation period in Fig- 
ure 5 (after time step n = 250) relates to the fact that 
symbols which the agent had already learnt, suddenly in- 
vert their meaning in terms of the reward function. So 
the learnt behavior will, with high probability, lead to un- 
rewarded actions. A different situation is of course given, 
if the agent is confronted with a new symbol that it had 
not perceived before. In Figure 6, we have enlarged the 
percept space and introduced color as an additional per- 
cept category. In terms of the invasion game, this means 
that the attacker can announce its next move by using 
symbols of different shapes and colors. In the first period, 
the symbols seen by the agent have a specific color (red) , 
while at n = 250 the color suddenly changes (blue), and 
the agent has to learn the meaning of the symbols with 
the new color. Note that, unlike Figure 5, there is now 
no inversion of strategies, and thus no increased adapta- 
tion time. The agent simply has never seen blue symbols 
before, and has to learn their meaning from scratch [57]. 

The network behind Figure 6 is the same as in Fig- 
ure 4, with the same update rules, but with an extended 
percept space (four symbols) and four rewarded transi- 
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Figure 5: Learning curves of the defender agent for differ- 
ent values of the dissipation rate 7. The blocking efficiency 
increases with time and approaches its maximum value expo- 
nentially fast in the number of cycles. For 7 = the blocking 
efficiency approaches the limiting value 1, i.e. for each shown 
percept it will choose the right action. For larger values of 7, 
the maximum achievable blocking efficiency is reduced, since 
the agent forgets part of what it has learnt. At time step 
n — 250, the meaning of symbols is inverted, i.e. the sym- 
bol =>■ (<=) now indicates that the attacker is going to move 
left (right). Since the agent has already built up memory, 
it needs some time to adapt to the new situation. One can 
see a trade-off between adaptation speed, one one side, and 
achievable blocking efficiency, on the other side. Here, we 
have chosen an unbiased training strategy, P^ — 1/\S\. The 
curves are averages of the learning curves for an ensemble of 
1000 agents. Error bars (indicating 1 standard deviation over 
the sample mean) are shown on every fifth data point not to 
clutter the diagram, which also applies to the error bars in 
subsequent Figures. 



tions. The agent does not make use of the "similarity" 
between symbols with the same shape but with different 
colors. This will change in the next section, when we 
introduce the idea of composition as another feature of 
projective simulation, which will allow us to realize an 
elementary example of associate learning. 

Let us now come back to the notion of reflection. In 
Figure 7, we compare the performance of agents with dif- 
ferent values of the reflection time R. (Here we consider 
again training with symbols of a single color.) One can 
see that larger values of the reflection time lead to an 
increased learning speed. The reason is that during the 
simulation virtual percept-action sequences are recalled 
together with the associated emotion tags (i.e. remem- 
bered rewards). If the associated tag does not indicate a 
previous reward of the simulated transition, the coupling- 
out of the actuator into motor action is suppressed and 
the simulation goes back to the initial clip. In this sense, 
the agent can "reflect upon" the right action and its (em- 
pirically likely) consequences by means of an iterated sim- 
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Figure 6: Learning curve for enlarged percept space, with 
color as an additional percept category. In the first period, 
the symbols seen by the agent have the same color (e.g. red), 
while at time step n — 200 the color of the symbols suddenly 
changes (e.g. blue), and the agent has to learn the meaning 
of the symbol with the new color. Unlike Figure 5, there is no 
inversion of strategies, and thus no increased adaptation time. 
The agent simple has not seen symbols with the new color 
before, and thus has to learn them from scratch. Ensemble 
average over 1000 runs with error bars indicating one standard 
deviation. 



illation, and is thus more likely to find the right actuator 
move before real action takes place [58]. 

The possibility of reflection can thus significantly in- 
crease the speed of learning, at least as long the total time 
for the simulation does not become too long and starts 
competing with other, externally given time scales, such 
as frequency of attacks. 

Within an approximate analytical treatment, one can 
give a closed recursion relation for the mean entries of 
the h-matrix. Consider the general case of |S| different 
percepts and \A\ different actions, where for each percept 
there is a single rewarded action. For simplicity, let us 
assume a regular training scenario, P^ n \s) = S(s — n mod 
\S\) such that, within a subsequence of l^l cycles, each 
percept is excited exactly once and in the same order. 
For such a scenario, one can derive from (6) a recursion 
relation of the form 

fe< n+ l 5 l>(s,a)-l ~ (1-7) 151 (h M (s,a)-l^j 

h(s, a) 



+ (1-7) 



|S|-1- 



Ea'eA h (s,a') 



for rewarded transitions, and a similar expression, with- 
out the gain term (i.e. A = 0), for the unrewarded tran- 
sitions. Here, h( n \s,a) denotes the averaged weight for 
a rewarded transition @ — >• @, taken over an ensemble 
of runs. Equation (11) is not exact and in general con- 
tains an overestimation of the gain term, but for small 
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Figure 7: Performance of agents with different values of the 
reflection time: R = 1 (lower curve) and R — 2 (upper curve). 
One can see that a large value of the reflection time leads 
to an increased learning speed. The dissipation rate (which 
is a measure of forgetfulness of the agent) is in both cases 
7 = 1/50. Ensemble average over 1000 runs with error bars 
indicating one standard deviation. 



values of 7 it gives a rather good approximation to the 
numerical results [37]. The steady-state condition reads 
U n +\ s \\s : a) = ft( n )(s,a) = h(s,a), whereby h(s,a') = 1 
for all unrewarded transitions. This leads to quadratic 
equations of the form 



h(s, a) 



(1- 7 ) |S| (h(s,a)-l) 



h(s, a) 



h(s, a) 



1 



> (12) 



that can be solved analytically, providing an approximate 
value for the steady-state blocking efficiency f ~ 

shown in Figure 8. (For A = 0, one obtains from (12) 
the trivial steady-state value h(s,a) = 1, recovering the 
value r = 0.5 for random action). Similarly, based on 
(11), one can derive an approximate analytic expression 
for the initial slope of the learning curve 



Af 
An 



A(l- 7 ) 



\s\-i/ 



I) 



\A\*(\S\- l + |5|/2) 



(13) 



In Figure 8, we plot the learning curves (evolution of the 
average blocking efficiency) together with the analytic 
approximations, for different values of l^l, \A\, and A. 

We next investigate the performance of the agent for 
more complex environment in order to illustrate the scal- 
ability of our model. In the invasion game, a natural 
scaling parameter is given by the size \S\ of the percept 
space (number of doors through which attacker can in- 
vade) and/or the size \A\ of the actuator space. As a fig- 
ure of merit, we have looked at the learning time r = To. 9, 
which we define as the time the agent needs to achieve 
a certain blocking efficiency (for which we choose 90% of 
the maximum achievable value). We find that learning 
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Figure 8: Initial growth and asymptotic value of average 
blocking efficiency for different sizes of percept (16*1) and ac- 
tuator (| -A |) space, and reward parameter A. The learning 
curves are obtained from a numerical average over an ensem- 
ble of 10000 runs with random percept stimulation (7 = 0.01). 
Error bars (not shown) are of the order of the fluctuations in 
the learning curves). The analytic lines are obtained from 
(11), see main text. 



time increases linearly in both | aS' | and \A\, (i.e. quadrat- 
ically in TV, if we set N = \A\ = 1*51). The same scaling 
can be observed if we apply standard learning algorithms 
like Q-learning or AHC [1] to the invasion game [37]. In 
Figure 9, the scaling of the learning time is shown for 
different values of R. Besides the linear scaling with 
it can be seen how reflections in clip space, as part of the 
simulation, speed up the learning process. 
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Figure 9: Learning time To. 9 as a function of | S\ for differ- 
ent values of the reflection parameter R. We observe a linear 
dependence of To. 9 on \S\ with a slope determined by R. En- 
semble average over 10000 runs, 7 = 0. 

A more detailed discussion of the analytic results, to- 



gether with a comparison of PS with other models of 
reinforcement learning, will be presented in [37] (see also 
Section VI). 



B. Projective simulation & learning with 
composition I 

The possibility of multiple reflections, as discussed in 
the previous subsection (Figure 7), illustrates an advan- 
tage of having a simulation platform where previous ex- 
perience can be reinvoked and evaluated before real ac- 
tion is taken. 

The episodic memory described in Figure 4 was of 
course a quite elementary and special instance of the 
general scheme of Figure 2. We have assumed that the 
activation of a percept clip is immediately followed by 
the activation of an actuator clip, simulating a simple 
percept-action sequence. This can obviously be general- 
ized along various directions. In the following, we shall 
discuss one generalization, where the excitation of a per- 
cept clip may be followed by a sequence of jumps to other, 
intermediate clips, before it ends up in an actuator clip. 
These intermediate clips may correspond to similar, pre- 
viously encountered percepts, realizing some sort of as- 
sociative memory, but they may also describe clips that 
are spontaneously created and entirely fictitious (see Sec- 
tion VB). 

Such a scenario, which generalizes the situation of Fig- 
ure 4, can be summarized by the following rules. 

1. Every percept s triggers a sequence of memory clips 
T = (0,@,. . . , @, @), starting with and end- 
ing with some actuator clip @ [59]. The number 
D denotes the deliberation length of the sequence. 
The case D = corresponds, per definition, to the 
direct sequence T — (0,@). 

This is illustrated schematically in Figure 10, where 
we show an example of an episodic memory archi- 
tecture with sequences of deliberation length D = 
and D = 1 is shown. Here, after excitation of the 
percept clip, the agent may either excite an actu- 
ator clip directly, or first excite some other inter- 
mediate clip which, in its turn, activates an actu- 
ator clip. We shall sometimes refer to the former 
sequence as "direct", and to the latter as "compo- 
sitional" . 

2. If (5, a) corresponds to a rewarded percept-action 
pair (i.e. it was rewarded in a recent cycle and the 
corresponding emotion tag is set to ©) [60], then 
the simulation is left and the actuator clip @ is 
translated into real action a. Otherwise, a new 
(random) sequence T' = (0,^),. . . , , @) is gen- 
erated, starting with the same percept clip but 
ending possibly with a different actuator clip @. 
The (maximum) number of fictitious clip sequences 
that may occur before real action is taken is given 
by the reflection time R. 
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3. The probability for a transition from clip @ to clip 
© is determined by the weig hts h^(c,c') of the 
edges of a directed graph [38] connecting the corre- 
sponding clips: 



(c'|c) 



/i( n )(c,c') 
£ c „ /i(»)(c,c") 



(14) 



where the sum in the denominator runs over all 
clips that are connected with © by an outgoing 
edge (i.e. an edge directed from © to 0). 

4. After the simulation in cycle n is concluded, some 
action will be taken which we denote by a^ n \ If 
the action is rewarded (i.e. A(s^ n \a^) = 1), 
then the weights of all transitions that occurred in 
the preceding simulation will be enhanced: 
(i) The weights of transitions © — > © that appear 
in the simulated sequence r=((3), . . . , ©, ©, . . . , 



@) with s 
amount 



s( n ) and a 



increase by the 



if for z = l,...,L> 

A + M n )(s D ,a) = 1. 



(15) 



(ii) In addition, the weight of the direct transition 
@ — >• @ will also be increased by unity 



A+M n )(s,a) = 1. 



(16) 



The parameter K thereby quantifies the growth 
rate of "associative" (or compositional) connections 
relative to the direct connections, 
(iii) Furthermore, the weights of all transitions in 
the clip network, including those which were not 
involved in the preceding simulation, will be de- 
creased according to the rule 

4-ft (n) (c,c') = - 7 {h^(c,c') - h (c,c')) , (17) 

which describes damping towards a stationary 
value 



/i (c, c') 



fl, ifc 



e S and d G A 
ce S and c! G S 



(18) 



which distinguishes again direct connections from 
compositional connections, as illustrated in Fig- 
ure 10. If the chosen action at the end of cycle 
n is not rewarded, then no weights are enhanced 
and only rule (iii) applies. 

5. Concerning the initialization of the weights, vari- 
ous possibilities exist. Weights that are initialized 
to unity describe a sort of "innate" or a priori con- 
nections between a set of basic percepts and actua- 
tors. Other weights may initially be set to zero, for 
example on connections to more complex percepts, 




Memory- or 
fictitious clips 



Actuator clips 



Figure 10: Projective simulation with composition with de- 
liberation length D = 0, 1. Dark gray ovals indicate percept 
clips and light dark ovals indicate actuator clips. Initially the 
percept clip is excited. This may directly excite some actuator 
clip ("Direct transitions"), or some other memory clip or fic- 
titious clip ("Composition"). In the latter case, the memory 
(or fictitious) clip in its turn excites an actuator clip. 



for which there are no innate action patterns avail- 
able. A simple rule that allows the connectivity 
of the memory (graph of the clip network) to grow 
through new perceptual input, is the following: 

If a percept clip is activated for the first time, all 
incoming connections to that clip are "activated" 
together with it, meaning that their weights are 
initialized to a finite value (which we also set to K 
in the following) [61]. This enables the accessibility 
of that clip from other clips. 

To illustrate the workings of compositional memory, 
let us revisit the situation of Figure 6, where the percept 
space S = S1XS2 comprises both the categories of shape, 
s\ G Si, and color, 82 G S2 (the color of the shape), while 
the actuator space A and the emotion space E contain 
the same elements as before. This is a variant of the 
invasion game, where the attacker can announce its next 
move using symbols of different shapes and colors. The 
network of clips behind the learning curves presented in 
Figure 6 was simply a duplicated version of the graph 
in Figure 4, with identical subgraphs for the two sets of 
percepts of the same color. 

In contrast, in Figure 11, we see the learning curves for 
the same game but with a slightly modified memory ar- 
chitecture. After having trained the agent with symbols 
of one color (red), at time step n = 200 the attacker starts 
using a different color (blue). In comparison with Figure 
6, now the agent learns faster, and the speed of learning 
increases with the strength of the parameter K. This sit- 
uation resembles a form of "associative learning" , where 
the agent "recognizes" a similarity between the percepts 
of different colors (but identical shapes). 

The structure of the memory that gives rise to these 
learning curves is sketched in Figure 12, which corre- 
sponds to a duplicated network described before, albeit 



12 




1 1 1 


1 1 1 


I 

-J 


iff 

I 




1 

K=0 _ 




K=l/10 




K=l - 


i 

i i i 


K=2 _ 

i i i 



50 100 150 200 250 300 350 400 



time 

Figure 11: Associative learning through projective simulation. 
After first training the agent with symbols of one color (red) , 
at time step n — 200 the attacker starts to use a different 
color (blue). In comparison with Figure 6, now the agent 
learns faster. This situation resembles a form of "associative 
learning", when the agent "recognizes" a similarity between 
the percepts of different colors, but identical shapes. The 
effect can be much enhanced if one allows for reflection times 
R > 1. The memory that gives rise to these learning curves is 
depicted in Figure 12. Ensemble average over 10000 agents. 



with additional links between percepts of equal shape but 
different color. In Figure 12, we see the effect of learning 
on the state of the network at different times. Initially, 
before any stimulus/percept has affected the agent, the 
network looks as in Figure 12(a), with innate connections 
of unit weight between all possible percepts and actua- 
tors, respectively. Figure 12(b) shows the state of the 
network after the agent has been trained (indicated by 
the dotted arrows) with symbols of one color (red). We 
see that the weights for rewarded transitions have grown 
substantially such that the presentation of a red symbol 
will lead to the rewarded actuator move with high prob- 
ability. Moreover, the activation of the red-percept clips 
has initialized the incoming connections from similar per- 
cept clips with a different (blue) color. In this example, 
the weights are initialized with the value K. This initial- 
ization has, at this stage, no effect on the learning perfor- 
mance for symbols with a red color. However, when the 
agent is presented with symbols of a different color, the 
established links will direct the simulation process (prob- 
abilistically) to a "trained" region with well-developed 
links. This realizes a sort of associative memory (Fig- 
ure 12(c)). In the philosophy of projective simulation, 
association is a special instance of a compositional pro- 
cess, namely a random walk in clip space where similar 
clips can call each other with certain probabilities [62]. 

Note that, in case of the associative learning, only the 
incoming links (i.e. transitions) to that percept are ac- 
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Figure 12: Effect of associative learning on the state of the 
episodic memory at different times. The thickness of the lines 
indicate the transition probabilities between different clips. 

(a) Initial network, before any percept has affected the agent, 

(b) State of the network after the agent has been trained 
(dotted arrows) with symbols of one color (red), (c) When the 
agent is presented with symbols of a different color (blue) , the 
established links will direct the simulation process (probabilis- 
tically) to the previously "trained" region with well-developed 
links. This realizes a sort of associative memory. 



tivated together with it, thereby making its subsequent 
links potentially available to similar new percepts. A 
network where also outgoing links are activated performs 
typically worse, in particular when the size of the percept 
space (number of colors) grows. In that case, even when 
a single percept is trained, the agent has to explore all 
similar percepts together with it, which may lead to a 
significant slowing down of the learning speed. 

In Figure 13, we discuss further aspects of associative 
learning that follow from the rules of the projective sim- 
ulation. We saw in Figure 11 that the learning speed 
increases with the parameter K, which describes the rel- 
ative rate at which the weights of the compositional con- 
nections grow relative to the direct connections. How- 
ever, too large values of K can also have a counterpro- 
ductive effect, as the agent spends an increasing fraction 
of time with the simulation before it takes real action. In 
fact, it can almost get "lost" in a loop-like scenario where 
it jumps back and forth between virtual percept clips for 
a long time. In Figure 13, we plot the average delibera- 
tion time, i.e. the average time for which the simulation 
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Figure 13: Average deliberation time, i.e. the average time 
how long the simulation stays in compositional memory. A 
deliberation time that is too long will, in this example, have 
a negative effect on the learning fidelity as it will also have an 
increased access to other, wrong channels. Dissipation rate 
7 = 1/50; ensemble average over 10000 agents. 



stays in compositional memory. The scenario is the same 
as in Figure 11. After the change of color of the symbols, 
the agent will learn by building up new transitions in the 
network, but this learning will be assisted by using the 
pre-established transitions of the previous training period 
(Figure 12(c)), which will increase the deliberation time. 
For K < 1 the deliberation time is maximal right after 
the change of colors, and decreases again as the agent 
is developing direct connections from the percept clips 
to the rewarded actuator clips. For K = 2, however, 
the deliberation time continues to grow with the num- 
ber of cycles, until it settles at some value around 1.4 
(not shown). For larger values of the asymptotic av- 
erage deliberation time can be significantly larger. In the 
network of Figure 12(c) the latter situation means that 
the simulation can get lost in a loop by jumping back 
and forth between similar (red and blue) clips. While in 
the simple example of Figure 12(c) this may be avoided 
by certain ad hoc modifications of the update rule, it is 
a generic feature that will persist in more complex net- 
works. 

A deliberation (i.e. simulation) time that is too long 
will, in this example, eventually have a negative effect 
on the achievable blocking efficiency, as can be seen from 
the long-time limit of the learning curves in Figure 11. 
A slight decrease of the asymptotic blocking efficiency 
for larger values of K occurs because, by association, 
the simulation will also gain access to other unrewarded 
transitions inside the network [63]. The potentially neg- 
ative effect of high values of K gets more pronounced if 
the agent, by external constraints, only has a finite time 
available to produce an action. In our example of the 
invasion game, this could be the time it takes for the 
attacker to move from one door to the next. This intro- 
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Figure 14: (a) Learning curve for different values of the as- 
sociativity parameter K if the agent, by external constraints, 
has only a finite time available to produce an action. If the 
simulation takes longer than D m3LX , the agent will not be re- 
warded. In such a case, the asymptotic performance of the 
learning drops dramatically for large values of K. An ensem- 
ble average over 10000 games is shown. 



duces a maximum deliberation time D max to our scheme. 
If the simulation takes longer than D max , the agent ar- 
rives too late at the door even if it chose the right one, 
and will consequently not be rewarded. In such a case, 
the asymptotic performance of the learning for large val- 
ues of K drops significantly, as can be seen in Figure 14 
for D max = 2. For short times, when the strengths of 
the transitions have not yet grown too large, the simula- 
tion still benefits from the association effect where, after 
jumping from a percept clip @ (red) to percept clip @ 
(blue), there will be a strong transition to an actuator. 
For longer times however, the weights on the composi- 
tional links have grown so strongly that they will also 
dominate over the direct links from percept clips to ac- 
tuator clips. In summary, while compositional memory 
can help, too large values of K can be counterproductive, 
as the agent will most of the time be "busy with itself" . 

Before we proceed in the following section to discuss 
yet another possibility how to use the compositional 
memory for learning, it should be noted that many of 
the observed features can be changed by varying the pa- 
rameters j : R,K in the update rules, or by modifying 
the ways of initializing the memory. For example, as 
we have seen earlier (in Figure 5), dissipation introduces 
a mechanism of forgetting, which limits the achievable 
success probability but at the same time gives the agent 
more flexibility of adapting to a new strategy of the at- 
tacker. To have an agent with both a high flexibility and 
a high blocking efficiency, one can choose a finite value 
of dissipation rate 7 together with an increased reflection 
time i?, as is demonstrated in Figure 15. A similar en- 
hancement can be observed for the associativity effect in 
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Figure 15: To obtain an agent with both high flexibility to 
adapt to new attack strategies, and with a high blocking effi- 
ciency, one can combine a finite dissipation rate 7 (flexibility) 
with an increased reflection time R — 2 (efficiency). The 
plots should be compared with Figure 5. Ensemble average 
over 10000 games. 

Figure 11 by increasing R [64]. 

Another possibility to increase the achievable efficiency 
is to let the connections of the network dissipate com- 
pletely when they are not used. While the innate network 
is characterized by a high connectivity, a trained network 
will develop both enhanced and suppressed connections. 



C. Projective simulation &; learning with 
composition II 

In the previous section we saw that projective simu- 
lation allowed for associative learning: A novel percept 
(clip), which had no a priori preference for any actuator 
movement, could excite another clip in episodic memory, 
from which strong links to specific actuators had been 
built-up by previous experience. The agent, while pre- 
sented with a blue arrow, would, with a certain probabil- 
ity, associate it with a red arrow whose meaning it was 
already familiar with. 

A different and more complex behavior can be gener- 
ated if the agent's actions are not only guided by recall- 
ing episodes from the past, but if it can create, as part 
of the simulation process itself, fictitious episodes that 
were never perceived before. In the course of the simu- 
lation it may for example introduce variations of stored 
episodes, or it may merge different episodes to a new one, 
thereby varying or redefining the (virtual) past. The test 
for all such projections is whether or not the resulting 
(factual) actions will eventually be rewarded. In other 
words, it is the performance of the agent in its real life, 
that selects those virtual episodes that have led to suc- 
cessful actions, enhancing the corresponding connections 




Figure 16: Creation of a new and fictitious clip in the mem- 
ory of the two-dimensional agent. This figure illustrates the 
schematic evolution of the (relevant part of the) clip network 
behind Figure 17. Frequent excitation of two different actu- 
ator clips from a single percept clip leads to the creation of 
a novel, merged, clip which becomes part of the existing clip 
network. (See main text.) 



in memory. These principles give the agent a notion of 
freedom [4] to "play around" with its episodic memories, 
while at the same time optimizing its performance in the 
environment. 

While it is intuitively clear that such additional capa- 
bility will be beneficial for the agent, its world (i.e. task 
environment) must be sufficiently complex to make use 
of this capability. A typical feature of a complex environ- 
ment is that the agent can, at some point, "discover" new 
behavioral options that were previously not considered, 
i.e., not in the standard repertoire of its actions [65]. 

To map the essential aspects of such a complex sit- 
uation into our example, we imagine a modification of 
our invasion game where the defender- agent can move in 
two dimensions, i.e. up and down in addition to left 
and right. In the notation of Section IV, this cor- 
responds to an enlarged actuator space A = A\ x A2 
with a = (ai, (22) G {+, 0, — } x {+, 0, — } such that, with 
this notation, right= (+,0), left= ( — ,0), up= (0,+), 
down= (0,—). In a robot design, the actuators a\ and 
d2 would refer to different motors for motion in x and y 
direction. One can imagine a two-dimensional array of 
doors in the x-y plane, through which the attacker tries 
to pass, now entering from the third dimension (z-axis). 
The attacker will move along any of these four direc- 
tions as well, and use appropriate symbols to announce 
its moves. However, in addition to those moves, it will 
at some point start moving also along the diagonals, e.g. 
to the upper-left, in a single step. The defender will first 
continue to move in the trained directions, simply be- 
cause the more complex motion along the diagonal is not 
in its immediate repertoire (although it may technically 
be able to do it, e.g. by activating the two motors for hor- 
izontal and vertical motion at the same time). We assume 
that there are partial rewards if the defender moves into 
the right quadrant, e.g. by "blocking" at least one of the 
coordinates of the attacker. To be specific, we consider 
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Figure 17: Learning curve of a 2D agent (see text) which, after 
having been trained on the horizontal and vertical directions 
(using symbols <=, and ff, ^, respectively) is suddenly con- 
fronted, at time n = with moves of the attacker along the 
diagonal, announced by the symbol /*. We assume a rein- 
forcement scheme where a movement in the right quadrant 
(either right or up) is rewarded by a unit increase of the 
corresponding clip transitions, while a composite movement 
along the diagonal direction (+45) is rewarded stronger, e.g. 
by A = 4. The agent will first quickly learn to move into the 
right quadrant - under the rules described in the previous 
sections - while on a longer time scale it will find discover the 
corresponding composite move which the higher reward. 



the situation where, from a certain point on, the attacker 
always moves to the upper-right corner (i.e. along the 
+45° diagonal). If the agent moves right or up, it will 
be rewarded, if it moves left or down, it will not. Under 
the rules specified so far, the agent will, after a transient 
phase of random motions, be trained so that it will move 
either up or right, with equal probability of ~ 50% each. 
How can the agent conceive of the "idea" that it could 
also move along the diagonal direction, by letting both 
motors run simultaneously, if this composite action was 
not in its immediate (or: active) repertoire? [66] The 
scenario of projective simulation allows for the possibil- 
ity that, through random clip composition, a merged or 
mutated clip can be created, that triggers both motors of 
a composite actuator move. In a sense, the agent would 
simulate this movement, by chance, before it tries it out 
in real life. The latter may occur specifically in situations 
with multiple rewards (or ambivalent moves). 

One can think of several possibilities of defining clip 
merging and variation. A natural possibility exists if, in 
generalizing our scheme, we allow for parallel excitations 
of several clips at the same time. Depending on some 
compatibility constraints, more than one of these clips 
could then couple out and lead to simultaneous actuator 
moves. 

In the present scenario, however, the simulator can 
only activate one clip at a time, but it will happen that 
two of the clips (e.g. those associated to right and up) 
are activated frequently and with similar probabilities. 



Here one can e.g. define a threshold scheme where a merg- 
ing of both clips is likely to happen only under the condi- 
tion that the connections to both of them are sufficiently 
strong [67] . The merging itself can be defined on the set 
of basic elements which make up the clips, obeying cer- 
tain syntactic constraints. For example, in the case of the 
two-dimensional invasion game, we may merge the actua- 
tor clips corresponding to right = (+, 0) and up = (0, +) 
into a new clip corresponding to right -up = (+,+), but 
it is syntactically forbidden to merge right = (+,0) and 
left = (-,0). 

To demonstrate the basic idea, we have implemented 
a rule according to which the frequent excitation of dif- 
ferent actuator clips (of syntactically compatible moves) 
from a single percept clip creates at some point a novel, 
merged, actuator clip which becomes part of the clip net- 
work. Figure 16 illustrates the schematic evolution of the 
(relevant part of the) clip network. The grey arrows in- 
dicate previously grown transitions, after the agent has 
been trained in the horizontal (=>) and vertical (ff) direc- 
tions. After such an initial training period, the agent is 
confronted (dotted arrow) with diagonal moves (see left 
part of Figure 16), announced by the symbol {/). When 
the weights on the two different transitions leaving clip 
grow beyond a given threshold, a new merged clip is 
created and connected to 0, with a weight that is equal 
to the sum of the weights on the constitutive transitions. 
This merging process is indicated schematically in the 
right part of Figure 16. 

In Figure 17, we show the resulting learning curve of 
the agent, which was previously trained (n < 0, not 
shown) on the horizontal and vertical directions (using 
symbols <=, => and ff, J|, respectively) and is then (at 
time n = 0) confronted with moves of the attacker along 
the diagonal (announced by the symbol {/)) [68]. We 
assume a reinforcement scheme where a movement into 
the correct quadrant (either right or up) is rewarded by 
a unit increase of the corresponding weights in the clip 
network, while a composite movement right -up (both 
right and up) is rewarded stronger, with A = 4. One 
can see that the agent will first quickly learn to move 
into the right quadrant - under the rules described in 
the previous sections - while on a longer time scale it 
will discover the corresponding composite move with the 
higher reward. 



VI. CONNECTION WITH EXISTING 
LITERATURE 

The problem of learning has been investigated in var- 
ious fields ranging from psychology, cognitive neuro- 
science, and philosophy, to artificial intelligence, machine 
learning, and robotics. In the following, we shall compare 
our model with some of the works in these fields. 

Historically, the idea of using internal representations 
and simulations for learning and prediction was already 
recognized as a key ingredient for cognitive development 
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in the works by Tolman [9] (idea of cognitive maps) and 
Piaget [10] (role of the internal manipulation of represen- 
tations). The notion of episodic memory was introduced 
in psychology in the 1970s by Tulving [7] and Ingvar [8], 
and it has been attracting increasing attention in various 
fields. The specific role of episodic memory for simulat- 
ing future events has recently been discussed by Schacter 
et al. [15] in the neurosciences, and by Hasselmo [16] who 
discusses brain mechanisms for episodic memory. 

Concepts and ideas for learning play also a major role 
in artificial intelligence, machine learning and robotics. 
The problem of prediction is indeed one of the main top- 
ics in machine learning, starting with the seminal work 
of Holland [30] who introduced the notion of classifier 
systems, and many subsequent works have used ideas of 
internal simulation for planning and prediction (for ex- 
ample [25, 26, 27, 28, 29] and references in reinforcement 
learning as discussed below). While classifiers [30] bear 
a certain similarity with the notion of clips that we have 
introduced in this paper, there are important differences. 
First, learning classifier systems assume a population or 
ensemble of classifiers (i.e. condition-action rules) and in- 
volve a deterministic computation (of the average predic- 
tion of a sub-ensemble of classifiers advocating a certain 
action), after which a specific action is chosen. The ran- 
dom walk through the clip network, in contrast, is much 
more primitive; it involves no ensemble and no computa- 
tion. Instead, it amounts to the random hopping through 
a set of possible clips (including the possibility of creating 
new clips along the way), without the ability of choosing, 
sampling, averaging, or in any way optimizing over that 
set. Every projective simulation corresponds to a single 
trajectory of a stochastic process (this is important for 
subsequent quantum generalization, see Section VII). 

In the field of reinforcement learning [1], a number of 
ideas have been discussed which are in some sense related 
to our work [17, 18, 19, 20, 21, 22, 23, 24]. This concerns 
in particular the notion of experience replay by Lin [17] 
and recent ideas by Sutton et al. [21]. The work by Lin 
[17] studies several extensions to standard reinforcement 
learning algorithms, the most relevant of which, for our 
present work, is the method of experience replay. In Lin's 
model, "by experience replay, the learning agent simply 
remembers its past experiences and repeatedly presents 
the experiences to its learning algorithm as if the agent 
experienced again and again what it had experienced be- 
fore" ([17], p. 299). This idea of experience replay has 
a certain similarity with the our notion of multiple re- 
flections in clip space (indicated by the parameter R in 
Equation (9) and in Figure 7); yet, a closer inspection 
reveals both conceptual and technical differences. The 
main effect of experience replay in the sense of Lin is to 
boost the learning process which, in our model, would 
amount to an (off-line) change of the weights in the clip 
network. Experience replay is like a module for (self- 
reaching: After experiencing a real situation once, the 
agent gets the chance to review this experience again 
and again, before taking the next action. Our notion of 



episodic memory differs from this one inasmuch as it uses 
an explicit internal representation and allows more sub- 
tle ways of re- using previous experience. For example, 
the occurrence of multiple reflections, which also boost 
the learning speed, is conditioned on the state of certain 
emotion flags that represent short-time memory. These 
flags prevent the agent from taking an action that was re- 
cently found non-rewarded and give the agent a "second 
chance" to find the right action, but these internal reflec- 
tions do not change the weights of the clip network. As 
a second example, the possibility of clip composition (as 
discussed in Section VC) introduces structural changes 
that also go beyond mere changes of the weights in the 
clip network. Generally speaking, projective simulation 
is more integrated with the real actions of the agent; it 
is a continuous process that runs in parallel ("on-line") 
with the real actions. 

The work by Sutton et al. [18, 21] on Dyna-style plan- 
ning seems in that respect closer to our work. Quot- 
ing from Ref. [21]: "Dyna-style planning proceeds by 
generating imaginary experience from the world model 
and then applying model-free reinforcement learning al- 
gorithms", this sounds reminiscent to the use of projec- 
tive simulation to generate fictitious sequences of memory 
to guide subsequent action. The underlying conceptual 
framework is, nevertheless, quite different. Like most re- 
inforcement learning algorithms, the framework of Dyna- 
style planning is much more computational than our ap- 
proach. It uses world models for planning and to decide 
the course of action. Such planning involves a non-trivial 
computational process (Dyna-algorithm for policy eval- 
uation) the result of which is then used by the agent 
to find the optimum course of action. Projective simula- 
tion, as mentioned before, is much more primitive; it only 
involves random hopping through a set of clips, without 
any further computation. The only parameters that need 
to be changed and updated in the clip network are the 
weights of the clip transitions, similar as neural networks 
(however with the difference that new clips may be cre- 
ated). In that sense, projective simulation is much more 
embodied and should rather be compared with a biolog- 
ical stochastic process than with the result of planning 
and computation. 

Despite their conceptual differences, on simple tasks 
like the invasion game, these different learning models 
show similar features. In Figure 18, we compare the per- 
formance of the learning models in the invasion game 
with two symbols and two actions, |5| = \A\ = 2, 
where the attacker changes the meaning of the symbols 
at n = 150. We compare learning curves of (a) projective 
simulation, using multiple reflections (reflection number 
i?), with (b) experience replay (replay number N), and 
(c) Dyna-style planning (planning number p), where the 
latter two models were based on the Q-learning algorithm 
[1]. Increasing the parameters i?, A", and p leads to an 
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Figure 18: Comparison of projective simulation with ex- 
perience replay [17] and Dyna-style planning [21]. Learning 
curves are shown for (a) projective simulation (reflection num- 
ber R) with 7 = 1/10 and A = 1, (b) experience replay (re- 
play number iV), (c) Dyna-style planning (planning number 
p), whereby both (b) and (c) use the tabular Q-learning al- 
gorithm [1] with a softmax action selection rule, based on the 
Boltzmann distribution. For both (b) and (c) the Q func- 
tion was initialized to 1, and a reward of 1.5856 was used 
together with a learning-rate parameter of a = 0.4. The pa- 
rameters were chosen such that for R = 1, N = l,p = 0, the 
initial learning speed and the asymptotic value of the respec- 
tive learning curves are similar. In (c) the imagined state and 
action were picked randomly out of all possible states and ac- 
tions. It is seen that increasing the parameters R, N, and p 
leads to an increased learning speed in each of the respective 
models, with similar performance. However, different from 
experience replay and Dyna-style planning, projective simu- 
lation with multiple reflection increases not only the learning 
speed but at the same time the maximum achievable value of 
the learning parameter (blocking efficiency). 



increased learning speed in each of the respective models, 
with similar performance. However, different from expe- 
rience replay and Dyna-style planning, projective simu- 
lation with multiple reflections (as defined in Section III) 
increases not only the learning speed but also the max- 
imum achievable value of the blocking efficiency. The 
latter can also be achieved in (b) and (c) by changing 
the external reward. 

Generally speaking, we find that on simple tasks like 
the invasion game the performance of projective simu- 
lation is certainly competitive with other modern rein- 
forcement learning algorithms such as experience replay 
[17] or Dyna-style planning [21]. For more complex task 
environments these different models may perform differ- 
ently well on different aspects. With increasing dimen- 
sion of percept and action space, we find a linear scaling 
of the learning time with \S\ and \A\, respectively, similar 
as for Q-learning [37]. For problems that require long- 
term planning, we expect methods based on Q-learning 
or adaptive heuristic critique [1] to be more favorable, 
whereas projective simulation with the possibility of clip 
composition, as discussed in Section VC, should be fa- 
vorable in problems where "creative" action in a given 
situation is in demand. A combination of ideas from 
projective simulation, such as the use of internal flags 
encoding short-time memory, with established algorithms 
for long-time planning is part of an ongoing investigation 
[37]. 



VII. QUANTUM PROJECTIVE SIMULATION 

We now address the generalization of projective sim- 
ulation to quantum mechanical operation. The motiva- 
tion of this question is twofold. One reason is the ongo- 
ing miniaturization of devices down to the scale of nano- 
technologies. It is conceivable that soon robots will be 
used to control matter even on the molecular and atomic 
scale, be it in basic research laboratories or in medical 
applications inside the human body. Agent research will 
then have to deal with issues of quantum feedback and 
control [39] and its future applications. 

Another, more direct, reason has to do with the com- 
putational capabilities of quantum computers. It was 
found that computers which operate on quantum me- 
chanical principles can solve certain mathematical tasks 
much more efficiently than any classical computer [5]. 
It is thus natural to ask whether a similar benefit can 
be expected for models of artificial intelligence when the 
architecture of agents involves quantum mechanics. If 
one defines an intelligent agent or robot simply as some 
machine with a "computer on board" and with sensors 
& actuators as "input-output devices", then the answer 
seems to be straightforward: Replace the classical com- 
puter with a quantum computer, run the right quantum 
algorithm on it, and thus obtain a more efficient agent. 
The question is then, of course, what is the right quan- 
tum algorithm. A more fundamental problem with this 
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approach is that such a computational viewpoint might 
miss essential aspects of intelligent behavior from the be- 
ginning. It seems that neither a classical computer nor 
a quantum computer per se will make the agent intel- 
ligent, nor will any fixed algorithm that runs on these 
devices. As it has been emphasized in recent literature 
on artificial intelligence [3, 6], the emergence of intelli- 
gent behavior seems to require continuous feedback be- 
tween the agent and its environment at its very heart: In 
modern terminology, the agent needs to be embodied and 
situated in an environment it interacts with [3]. Mod- 
ern notions of (reinforcement) learning and agents are 
developed within this framework, and so is our approach 
to creative behavior, in which the network of clips i.e. 
the episodic memory grows as the agent interacts with 
the world. Furthermore, the evolution of the episodic 
memory (clip network) is thereby firmly embedded in the 
agent architecture. 

In the following we describe how the model of projec- 
tive simulation can be generalized in the quantum regime, 
introducing a notion of quantum agents. In quantum 
mechanics, states of a system are described by vectors 
(or rays) in a complex Hilbert space, and observables 
by linear Hermitean operators acting on that space. A 
quantum-enhanced autonomous agent can be defined as 
an agent that interacts with a classical environment, but 
whose memory (or, more generally, internal state) uses 
quantum degrees of freedom [69]. In the notation and 
terminology of Section IV, the external variables s (per- 
cepts) and a (actions) are then still classical variables, 
while the clips c G C become quantum states \c) G He 
(Hilbert space of the memory). An external stimulus s 
will excite memory in a quantum state \c) = |@) (the 
percept clip) which has now the status of a basis state 
in the memory system. The random walk in clip space, 
which is an essential ingredient in our model, now be- 
comes a quantum walk in the associated Hilbert space of 
the (quantum) memory, with the replacements 

p(c'\c)^\(c'\c)\ 2 (19) 

for elementary transitions between clips, and 

p(c"\c)^\J2(c"\c')(c'\c)\ 2 . (20) 

c' 

for composite transitions. Here the scalar product 
(c'\c) defines the probability amplitude for the transition 
©— and the modulus squared in the expression for 
the composite transition gives rise to quantum interfer- 
ence, which is one of the basic features of quantum me- 
chanics. Quantum interference is in particular exploited 
in fast algorithms for quantum search [40] and quantum 
walks on graphs [41]. 

Let us now describe the quantization procedure in 
more detail. With the clip network as illustrated in Fig- 
ure 2 one can associate a graph G = (V, E), where the 
vertices j G V label the different clips Cj G C within the 
network and the edges {j, k} G E denote possible tran- 
sitions between clips. A quantum walk in memory space 



is then generated by a Hamiltonian of the form [42] 

H = J2 X i k { d k d 3 + + e A^3 ( 21 ) 

{j,k}£E jev 

= Yl X 3k(\ck)(cj\ + \cj)(c k \) + JZej|c i )(c i | 
{j,k}^E jev 

where the operator cj excites the memory from its ground 
state into clip Cj, 

|c,)=c]|vac) (22) 
and c\cj induces a transition Cj —> c k : 

4 g il c i) = \°k)' (23) 

The dynamical equation that describes the coherent 
quantum walk is given by the Liouville-von Neumann 
equation 

-p=-i[H,p] (24) 

where p = p(t) is the quantum state (density operator) 
of the memory at time t, [H, p] = Hp — pH is the com- 
mutator, and we have set Planck's constant to unity. 

The (real) coupling parameters Xj k in (21) induce co- 
herent transitions between the different clips in the net- 
work. One can also include further, incoherent, transi- 
tions described by a Liouvillean operator of the type 

Lp= ^k U\c jP c k c) - ^{clcjC k c]p + pc\cjC k c)} 

(25) 

with Kj k > 0, in which case (24) generalizes to the quan- 
tum master equation 

^- t P=-i[H,p]+Lp. (26) 

The dynamical equation (26) represents a generaliza- 
tion to the master equation/stochastic process that de- 
scribes the classical random walk, which is formally re- 
covered in the limit where H = 0. The transitions gener- 
ated by the Hamiltonian part are coherent and give rise 
to quantum superpositions and interference, which lies 
at the heart of the quantum parallelism that is exploited 
in quantum computers and in quantum walks. The in- 
coherent transition generated by the Lindblad part can 
be interpreted as the result of spontaneous "quantum 
jumps" between different clip states. 

Most examples of quantum walks that have been stud- 
ied correspond to walks on undirected graphs. A pos- 
sibility to introduce directed walks is to add incoherent 
transitions generated by (25). The price one has to pay 
with such directed transition is that they introduce de- 
coherence, so in general there will be a balance between 
quantum coherence on one side, and directedness on the 
other side. In combining these elements, one can design 
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walks with coherent, bi-directional transitions in certain 
regions of the network (or graph), combined with inco- 
herent transitions that "project" to other regions, or that 
exit the clip network. The Hamiltonian used in (21) can 
be generalized to so-called composite walks [42] that in- 
clude further degrees of freedom associated with a given 
transition, which could be used to include the emotion 
tags (see Section IV) into the quantum mode, as well 
as to implement discrete quantum walks using quantum 
coins [43]. 

The clips themselves have a composite structure and 
may include remembered percepts s G S or actions a G A, 
each of which can be composed of different categories. 
This compositional structure is accounted for by a tensor- 
product in the Hilbert space of the clips. For example, 
in case of a percept clip c = /i(s), the corresponding clip 
operators have the form 

c+ =^(s) = (i\(s 1 )0^ 2 (s 2 )...^ N (s N ) (27) 

where fi\ is the memory operator that excites percept of 
category i (like, for example, color or shape). 

A call of episodic memory in this picture involves three 
steps, which also illustrates the embedding of the quan- 
tum walk into the otherwise classical agent architecture: 

• Memory activation. Classical percept s G S 
triggers the excitation of an associated memory 
state: s \-> p(s) = \t/j(s))(ip(s)\. (In the simplest 
case, \ip(s)) — \s) — /^(s)|vac), but \i/)(s)) could also 
involve superpositions of several percept states related 
to s.) 

• Quantum walk through the network of clips, as 
described by the quantum master equation (26) 
with Hamiltonian (21) and with \^(s)} as initial 
state. 

• Memory output. A classical signal that induces 
(real) action is generated by the measurement of 
certain memory observables. (In the examples de- 
scribed in Section V these are the actuator observ- 
ables p^(a)p,(a), and the probability pt(a>) for an ac- 
tuator motion a to be triggered at time t is given by 
Pt(a) — tr(p\a)jl(a)p(t)) — tr(p(a)p(t)p^ (a)) where 
pit) is the state of the memory at time t.) 

This described model represents a generalization of the 
classical random walk, which can be recovered from (26) 
by switching off the coherent interactions. It is clear 
that the possibility of creating quantum superpositions 
of many different percept states opens the door for po- 
tentially huge speed-ups in exploring memory [43] , which 
is subject of an ongoing investigation [37]. Note that 
quantum random walk processes similar to (26), with 
engineered quantum many-body interactions, have re- 
cently been realized in the context of dissipation-driven 



quantum simulation with trapped ions [44]. Similarly, 
quantum simulators based on laser-driven atomic gases 
in optical lattices have been proposed [45, 46] and are 
currently being explored in many laboratories. 

The scheme that we have presented can be extended 
into various ways. Instead of a simple quantum walk, one 
can also introduce additional quantum computational el- 
ements when calling and processing episodes in memory 
space. A more detailed exposition of these ideas is be- 
yond the scope of this paper and will be given in future 
work. 



VIII. CONCLUSION 

We have introduced the notion of projective simula- 
tion and discussed its potential role for learning in ar- 
tificial agents. We have shown that it allows an agent 
to project itself into fictitious situations, which are self- 
generated by the agent (and its specific memory system) 
and which influence its future actions. Projective simu- 
lation enhances the learning capabilities of an agent and 
introduces an elementary notion of creative action. To 
illustrate the basic concepts, we have worked out simple 
but concrete examples of learning agents and the inter- 
play of simulation and episodic memory (ECM). We have 
programmed a learning agent that uses projective simu- 
lation, studied its behavior and tested its performance in 
the invasion game. The idea of projective simulation is 
however more general and we believe that the scheme, as 
part of a comprehensive embodied approach to artificial 
intelligence, could be implemented in autonomous agents 
or robots with realistic task environments. 

We believe that the "embodied approach" to artificial 
intelligence parallels in some way the recent strong atten- 
tion to the role of physics for the foundations of computer 
science (down to the level of quantum mechanics). In a 
similar spirit as people have studied the ultimate power 
of computers on the basis of physical law [47, 48] , we are 
here concerned with the question of the ultimate scope 
of intelligent behavior in embodied agents, taking into 
account the physical basis of this embodiment. To ap- 
proach this question, one first needs to develop a model 
of simulation in agents that is both physically grounded 
and at the same time general in its constitutive concepts 
(i.e. not linked to a specific implementation). We have 
shown that the abstract notion of clips and of projected 
simulation as a random walk through the space of clips, 
which grows dynamically by the specified rules of clip 
variation and composition, provides a first step towards 
such a general framework. From a physicist's perspec- 
tive, such a random walk can be understood as the prop- 
agation of excitations of physical degrees of freedom that 
represent the information carrying quantities. Within 
such conceptual framework, we can formulate, for the 
first time, a meaningful notion of an embodied quantum 
agent, by extending the model of projective simulation 
to the quantum regime. 
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This increases both the learning and the adaption speed 
of the agent substantially. At the same time, there is no 
notion of a "long-term" or "short-term" memory, which 
could be preferential in other scenarios, since the agent 
learns very fast. 

[56] If the reward situation changes in a certain point, e.g. 
due to a change of strategy of the attacker, the tagging 
of a given clip transition will adapt subsequently. 

[57] A similar plot as in Figure 6 is obtained if the attacker 
uses symbols of a single color only, but shows in the first 
period (up to n = 249) only the symbol and in the 
second period (from n = 250 on) only the symbol <^=. 

[58] Of course, one could summarize the entire result of the 
simulation into an effective update rule for the agent's 
policy p( n \a\s), not referring to the internal rules of 
episodic memory recall. In this sense, we are just de- 
scribing a special type of learning agent. The point is, 
however, that we want to illustrate the idea of projective 
simulation and its flexibility in the specific context of a 
learning agent. 

[59] The case in which an actuator clip is not only excited at 
the end of a simulation, but also within, can be consid- 
ered as well. For simplicity, we will here focus our atten- 
tion to described sequences F. However, multiple excita- 
tions of actuator clips within a simulation do occur in the 
context of multiple reflections. 

[60] Note that there is a certain freedom as to which part 
of the sequence the tag should be associated. A simplest 
choice, which we follow here, is that the tag refers only 
to the states of the initial and the final clip. 

[61] In order that a percept can be perceived at all, it must al- 
ready have a (potential) representation in memory space. 
The architecture of memory reflects what is syntactically 
possible, i.e. what can be perceived a priori. We speak 
of the "activation" of a percept clip once it is hit by a 
real (external) stimulus for the first time. The additional 
quality of a percept that has already been stimulated is 
that, in memory space, the corresponding percept clip is 
not isolated but can be reached from other percepts. 

[62] The network in Figure 12(b) can be seen as a special 
instance of the network of Figure 10 where the interme- 
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diate memory clips correspond to the previously trained 
(red) percepts, and the percept clip takes the role of the 
yet-to-be-trained (blue) percept. 

[63] In the specific case of Figure 12(c), it is the transition 
from the percept clip @ (red) and the actuator clip Q. 
Details depend on the relative weights of the outgoing 
transitions leaving the loop. 

[64] For reflection times R > 1, the emotion tags associated 
to transitions — )>@ must be taken into account. Re- 
member that the emotion tag © (or ©) associated to the 
transition (s) — »• @ is the internal representation of the 
(most recent) reward A(s, a) = 1 (or 0) assigned to the 
sequence (s,a). This is also true in the case when the 
actuator clip @ corresponding to actuator a is the result 
of a deliberation sequence V = (®,^3,. • • , 0, @) with 
D > 0. 

[65] For example, when a child learns a certain ability, say, to 
stand up, for the first time, typically muscles are involved 
that where never used in synchrony (i.e. in a specific 
coordinated way) before, which makes it so difficult. At 
the same time the realization of the very possibility often 
comes with a strong feeling of pleasure and surprise. 

[66] In analogy with the previous example from child devel- 



opment, as described in the preceding footnote, one can 
imagine that the activation of more complex motions re- 
mains suppressed until the basic skills have first been 
learnt. 

[67] Alternatively, one can consider merging of two clips as a 
second-order process, where it can happen all the time, 
but with probabilities that are proportional to the prod- 
uct of the individual excitation probabilities. 

[68] Actually, the preceding training of the agent on the hor- 
izontal and vertical directions is not strictly necessary, 
in this example, if one assumes that there is an a priori 
connection between the percept clip Q) and the actua- 
tor clips (+, 0) and (0, +). Otherwise, the function of the 
preceding training is to activate those actuator clips for 
the first time and with it new incoming connections. 

[69] There are other situations conceivable where the environ- 
ment is quantum mechanical, and the task of the agent 
is to bring the environment into a certain quantum state. 
Depending on whether the agent employs quantum de- 
grees of freedom itself - in its memory, its sensors, or its 
actuators - one can define a variety of different agents. 



